Diamond prices prediction¶
Introduction¶
This project aims to develop a predictive model for estimating diamond prices ranges using machine learning algorithms. By leveraging various attributes and available data, including carat weight, color grade, cut quality, clarity, and dimensions, the goal is to create an accurate model that enhances decision-making for sellers, buyers, and investors in the jewelry sector.
Domain understaining¶
Research Questions¶
What factors influence diamond pricing?
Historical Overview
Tracing the historical evolution of the diamond industry, from its early allure to the industrial revolution, the narrative encompasses transformative shifts, the South African diamond rush, and the impact of pricing strategies influenced by economic trends, geopolitical events, and technological advancements. Ethical considerations, especially concerning conflict diamonds, have reshaped industry norms.
Diamond Characteristics
The Gemological Institute of America's (GIA) "4 Cs" (Carat, Color, Clarity, Cut) are foundational for assessing a diamond's value. Detailed insight into each element, alongside additional factors like certification, dimensions, and table size, contributes to a holistic understanding of diamond quality and value.
Market Segmentation
The global diamond market, valued at USD 89.18 billion in 2019, is explored through segmentation based on type (natural and synthetic), applications (jewelry and industrial usage), and regional markets. The analysis delves into market trends, emphasizing the growing importance of ethical sourcing and the significant role of Millennials and Generation Z consumers.
Customer Analysis
Understanding diverse age groups, customer archetypes, and cultural perspectives provides a nuanced view of the diamond market. Insights into customer preferences, ranging from sentiment-driven choices to practical considerations, shape marketing strategies and pricing models.
Expert Insights
Drawing from insights gained in an interview with a Senegalese jewelry seller, a distinctive pattern emerges in the local market. Most Senegalese consumers exhibit a preference for gold over diamonds, citing challenges in assessing diamond worth and logistical complexities, often requiring imports from countries like Dubai. This unique market dynamic positions diamonds not as everyday wear but rather as investments for future uncertainties or reserved for special occasions. The interviewee mentioned also that buyers prioritize the enduring value of diamonds, considering them a form of financial security. Cultural significance also plays a role, as diamonds are perceived as markers of important life events, aligning with global trends yet reflecting the practical considerations inherent in the local context.
Domain Understanding and Impact Analysis¶
Positive Impacts:
-Making Diamond Shopping Easier: For people not well-versed in diamond pricing, this A.I. makes diamond shopping less intimidating. It simplifies things and makes the market more accessible.
-Helpful Pricing Assistance: The A.I. model is like a helpful friend in the diamond market, quickly giving price suggestions. This makes it easier for potential buyers to decide without feeling overwhelmed.
-Possibility of Saving Money: The A.I. might help buyers save some bucks by guiding them to reasonable price ranges, preventing them from spending too much on diamonds.
Negative Impacts:
-Still Need Human Expertise: While the A.I. is great, it's no substitute for human experts. In tricky cases, it's essential to have a real person double-check things to make sure everything's spot-on.
-Not Perfect(but Close): The A.I. isn't 100% infallible. There's a little room for error (around 3%), so it's crucial to be upfront about this. It's more like a helpful assistant than a flawless expert.
-Understanding is Key: If users don't fully understand how to use the A.I. alongside their own judgment, there might be some confusion. Clear communication and teaching users about its strengths and limits are super important.
Data sourcing¶
Define Objectives
The project's goal is to develop a predictive model for estimating diamond prices using various attributes and available data.
Data Characteristics
The dataset structure includes essential attributes such as carat weight, cut quality, color grade, clarity grade, and price.
Data Sources
Data primarily sourced from Kaggle ensures comprehensive information about diamonds, maintaining privacy and adhering to licensing terms. Ethical considerations are emphasized, limiting usage to academic and research purposes.
Data Diversity
Ensuring a diverse dataset with varied diamond attributes and price ranges is critical for the accuracy and fairness of the predictive model.
Version Control
Implementing version control, such as Git, is essential for maintaining the integrity and reproducibility of the analysis, ensuring a transparent record of data-related actions.
Analytic approach¶
Initially identified as a regression task, the project considered linear regression but later shifted to the Random Forest algorithm. This decision was motivated by the desire to harness ensemble learning for more accurate and robust predictions. Key features for predicting diamond prices include Y dimension, color grade, and clarity grade. Continuous efforts focus on refining the model and optimizing parameters for improved predictions.
import sklearn
import pandas
import seaborn
import matplotlib.pyplot as plt
print("scikit-learn version:", sklearn.__version__) # 1.1.3
print("pandas version:", pandas.__version__) # 1.5.1
print("seaborn version:", seaborn.__version__) # 0.12.1
scikit-learn version: 1.3.0 pandas version: 2.1.0 seaborn version: 0.12.2
import warnings
warnings.filterwarnings('ignore')
Data provisioning¶
Data collection¶
Load the diamond's dataset
df = pandas.read_csv("Diamonds Prices2022.csv")
df
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
| 53940 | 0.71 | Premium | E | SI1 | 60.5 | 55.0 | 2756 | 5.79 | 5.74 | 3.49 |
| 53941 | 0.71 | Premium | F | SI1 | 59.8 | 62.0 | 2756 | 5.74 | 5.73 | 3.43 |
| 53942 | 0.70 | Very Good | E | VS2 | 60.5 | 59.0 | 2757 | 5.71 | 5.76 | 3.47 |
53943 rows × 10 columns
Data Understanding¶
df.describe()
| carat | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|
| count | 53943.000000 | 53943.000000 | 53943.000000 | 53943.000000 | 53943.000000 | 53943.000000 | 53943.000000 |
| mean | 0.797935 | 61.749322 | 57.457251 | 3932.734294 | 5.731158 | 5.734526 | 3.538730 |
| std | 0.473999 | 1.432626 | 2.234549 | 3989.338447 | 1.121730 | 1.142103 | 0.705679 |
| min | 0.200000 | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.400000 | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 0.700000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
| 75% | 1.040000 | 62.500000 | 59.000000 | 5324.000000 | 6.540000 | 6.540000 | 4.040000 |
| max | 5.010000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
Carat: The average carat weight is approximately 0.80, with a range from 0.20 to 5.01. Most diamonds (75%) have a weight of 1.04 carats or less.
Depth: The average depth percentage is around 61.75%. The depth ranges from 43% to 79%, with 75% of diamonds having a depth of 62.5% or less.
Table: The average table percentage is approximately 57.46%. The table width varies from 43% to 95%, with 75% of diamonds having a table width of 59% or less.
Price: The average price is about 3932.73, with a wide range from 326 to 18823. Most diamonds (75%) have a price of 5324 or less.
X, Y, Z (Dimensions): The average dimensions for length (X), width (Y), and depth (Z) are approximately 5.73 mm, 5.73 mm, and 3.54 mm, respectively.Additionally, the presence of zero values in these dimensions is unexpected and may indicate data entry errors or missing values.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53943 entries, 0 to 53942 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 carat 53943 non-null float64 1 cut 53943 non-null object 2 color 53943 non-null object 3 clarity 53943 non-null object 4 depth 53943 non-null float64 5 table 53943 non-null float64 6 price 53943 non-null int64 7 x 53943 non-null float64 8 y 53943 non-null float64 9 z 53943 non-null float64 dtypes: float64(6), int64(1), object(3) memory usage: 4.1+ MB
Exploratory Data Analysis¶
Featuress distributions¶
Price¶
We explore the distribution of diamond prices using histograms with varying bin widths. These histograms offer different levels of detail, allowing us to analyze the data from both a granular and generalized perspective. Let's observe how the choice of bin width influences our understanding of the price distribution.
# Define the binwidths
binwidths = [300, 5000, 7000]
plt.figure(figsize=(14, 4))
for i, binwidth in enumerate(binwidths, 1):
plt.subplot(1, 3, i)
plt.hist(df['price'], bins=range(0, df['price'].max() + binwidth, binwidth), color='gold', edgecolor='black')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.title(f'Histogram of Price (Binwidth: {binwidth})')
plt.tight_layout()
plt.show()
Smaller binwidths provide a granular view, revealing detailed patterns in the data. Larger binwidths, on the other hand, offer a more generalized perspective, highlighting broader trends. Notably there is no outliners and the histogram cover the entier price range. There is an inequality in the prices of diamonds and we can notice that when the price go up, the frequency of diamonds at those price points decreases .
Diamond's depth¶
Depth, represented as a percentage, signifies the height of a diamond's top facet relative to its girdle diameter. By examining the distribution of depth values, we aim to uncover patterns, outliers, and potential insights into the manufacturing of diamonds.
# Create a figure with two subplots side by side
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
seaborn.histplot(df['depth'], bins=100, color='deepskyblue', edgecolor='black')
plt.xlabel('Depth')
plt.ylabel('Frequency')
plt.title('Histogram of Depth of Diamonds')
# Zoomed-in Distribution Plot
plt.subplot(1, 2, 2)
seaborn.histplot(df['depth'], bins=100, color='deepskyblue', edgecolor='black')
plt.xlabel('Depth')
plt.ylabel('Frequency')
plt.title('Zoomed-in Histogram of Depth of Diamonds')
plt.ylim(0, 40) # Setting the y-axis limit for the zoomed-in plot
plt.tight_layout() # Ensures proper spacing between subplots
plt.show()
The first histogram provides an overview of the distribution of depth values in our dataset. By dividing the depth values into 100 bins and plotting their frequencies, we can observe the general pattern. In this graph, we notice that most diamonds have depth values concentrated within a specific range. This concentration may suggests that there might be a preference in the diamond industry regarding depth of diamonds. For a more detailed analysis, this graph is zoomed in on a subset of the data to examine the depth values in finer granularity in the second graph. By narrowing our focus, we can identify potential outliers or unusual patterns that might not be apparent in the overall distribution of depth values. In the second graph, we can discern a central tendency in the depth of diamonds on the range between 50 and 75. This indicate a standard and expected pattern in the industry.
However, our closer inspection also reveals outliers outside this central range. Diamonds with high or low depth values might have unique characteristics or data entry mistakes
Diamond's table¶
Next we will explore the 'table' values, a fundamental characteristic of diamonds. This data will provide insights into the width of the diamond's table in relation to its widest point. Examining the distribution will help us identify patterns and irregularities within our dataset.
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
seaborn.histplot(df['table'], bins=100, color='lightpink', edgecolor='black')
plt.xlabel('table')
plt.ylabel('Frequency')
plt.title('Histogram of table of Diamonds')
plt.subplot(1, 2, 2)
seaborn.histplot(df['table'], bins=100, color='lightpink', edgecolor='black')
plt.xlabel('table')
plt.ylabel('Frequency')
plt.title('Histogram of table of Diamonds')
plt.ylim(0, 40)
plt.tight_layout()
plt.show()
The first graph shows the spread of table percentages in our dataset. Most diamonds fall within a certain range, indicating industry preferences.
The second graph zooms in on table percentages between 45 and 80. Here, we see that many diamonds cluster around a common value, representing a standard industry practice.
However, there are outliers outside this range. These diamonds have unusually high or low table percentages, suggesting unique qualities or potential errors in the data. Understanding these exceptions is important for a clear picture of the dataset
Distribution of carat¶
Carat, a fundamental measure of a diamond's weight, profoundly influences its value. Exploring the distribution provides valuable insights into weight patterns. Let's delve into these trends for a comprehensive analysis.
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
seaborn.histplot(df['carat'], bins=100, color='teal', edgecolor='black')
plt.xlabel('Carat')
plt.ylabel('Frequency')
plt.title('Histogram of Carat of Diamonds')
plt.subplot(1, 2, 2)
seaborn.histplot(df['carat'], bins=100, color='teal', edgecolor='black')
plt.xlabel('Carat')
plt.ylabel('Frequency')
plt.title('Histogram of Carat of Diamonds')
plt.ylim(0, 40)
plt.tight_layout()
plt.show()
The zoomed-in graph for carat distribution highlights interesting insights. Initially, there's a dense cluster of diamonds with carat less than 3, indicating a popular choice among buyers. However, beyond this range, there's a noticeable decrease in the number of diamonds, creating a gap. While outliers might not be visible, this pattern suggests a market preference for smaller, well-crafted diamonds, with fewer choices in the larger carat range
To have a deeper analysis let look at carat >3
plt.figure(figsize=(8, 6))
smaller_diamonds = df[df['carat'] < 3]
seaborn.histplot(smaller_diamonds['carat'], bins=75, color='teal', edgecolor='black')
plt.xlabel('Carat')
plt.ylabel('Frequency')
plt.title('Histogram of Carat for Smaller Diamonds')
plt.show()
# Count the number of diamonds with carat value 0.99 and 1
count_099_carat = len(df[df['carat'] == 0.99])
count_1_carat = len(df[df['carat'] == 1])
print(f'Number of diamonds with 0.99 carat: {count_099_carat}')
print(f'Number of diamonds with 1 carat: {count_1_carat}')
Number of diamonds with 0.99 carat: 23 Number of diamonds with 1 carat: 1558
Exploring diamonds with carat values below 3 offers intriguing observations. The histogram illustrates distinct peaks at whole carat values (e.g., 1.0, 2.0), indicating a clear preference for these specific weights. However, between these whole carat values, there is a decline in frequency, suggesting that diamonds with weights slightly lower or higher than these standard values are less common. This pattern indicates a market preference for precisely measured carat weights.
Diamonds measurement y¶
y: width in mm
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
seaborn.histplot(df['y'], bins=100, color='limegreen', edgecolor='black')
plt.xlabel('Dimension y')
plt.ylabel('Frequency')
plt.title('Histogram of Dimension y of Diamonds')
plt.subplot(1, 2, 2)
seaborn.histplot(df['y'], bins=100, color='limegreen', edgecolor='black')
plt.xlabel('Dimension y')
plt.ylabel('Frequency')
plt.title('Histogram of Dimension y of Diamonds')
plt.ylim(0, 100)
plt.tight_layout()
plt.show()
The 'y' dimension of the diamonds mainly ranges from around 4 to 10. However, there are diamonds with a value of 0, which seems unrealistic for a physical dimension. Additionally, there are outliers with values exceeding 30, which could be data entry errors or exceptional cases that need further investigation.
Diamonds measurement x¶
x: length in mm
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
seaborn.histplot(df['x'], bins=100, color='deepskyblue', edgecolor='black')
plt.xlabel('Dimension x')
plt.ylabel('Frequency')
plt.title('Histogram of Dimension x of Diamonds')
plt.subplot(1, 2, 2)
seaborn.histplot(df['x'], bins=100, color='deepskyblue', edgecolor='black')
plt.xlabel('Dimension x')
plt.ylabel('Frequency')
plt.title('Histogram of Dimension x of Diamonds')
plt.ylim(0, 100)
plt.tight_layout()
plt.show()
The 'x' dimension of the diamonds also primarily falls within the range of approximately 4 to 10. Similar to 'y', there are diamonds with a value of 0, which requires scrutiny. Outliers could potentially distort the measurement data.
Diamonds measurement z¶
z: depth in mm
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
seaborn.histplot(df['z'], bins=100, color='darkorange', edgecolor='black')
plt.xlabel('Dimension z')
plt.ylabel('Frequency')
plt.title('Histogram of Dimension z of Diamonds')
plt.subplot(1, 2, 2)
seaborn.histplot(df['z'], bins=100, color='darkorange', edgecolor='black')
plt.xlabel('Dimension z')
plt.ylabel('Frequency')
plt.title('Histogram of Dimension z of Diamonds')
plt.ylim(0, 100) # Set the y-axis limits
plt.tight_layout()
plt.show()
The 'z' dimension of the diamonds primarily ranges from around 0 to 10. Similar to 'y' and 'x', there are diamonds with a value of 0 and outliers extending beyond 10. It's important to examine these cases to understand if they are valid measurements or anomalies in the dataset.
Cut , clarity and color¶
# Set up subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 6))
# Distribution of Cut
seaborn.countplot(data=df, x='cut', palette='Set1', ax=axes[0])
axes[0].set_title('Distribution of Cut')
axes[0].set_xlabel('Cut')
axes[0].set_ylabel('Count')
# Distribution of Color
seaborn.countplot(data=df, x='color', palette='Set2', ax=axes[1])
axes[1].set_title('Distribution of Color')
axes[1].set_xlabel('Color')
axes[1].set_ylabel('Count')
# Distribution of Clarity
seaborn.countplot(data=df, x='clarity', palette='Set3', ax=axes[2])
axes[2].set_title('Distribution of Clarity')
axes[2].set_xlabel('Clarity')
axes[2].set_ylabel('Count')
plt.tight_layout()
plt.show()
The dataset classifies diamonds into five cut styles: 'Ideal', 'Premium', 'Very Good', 'Good', and 'Fair'. The majority have an 'Ideal' cut, representing high-quality and sparkle. 'Premium' and 'Very Good' cuts are also common, offering good quality within a moderate price range. 'Good' and 'Fair' cuts are budget-friendly choices. This distribution reveals a preference for sparkling diamonds, with options desired to various budgets.
Diamonds are graded from 'D' (colorless) to 'J' (slightly tinted). Most diamonds in the dataset are colorless to near-colorless ('D' to 'G'), offering a mix of quality and value. The range also includes slightly tinted diamonds ('H' to 'J'), providing affordable options.However, there are more diamonds in the 'H' grade (slightly tinted) than in the 'D' grade (colorless)
and finally,looking at clarity, most diamonds in the dataset have slight inclusions ('SI1' and 'VS2'). 'SI2' diamonds, which also have some inclusions, are quite common too. Diamonds with very few inclusions ('VVS1' and 'VVS2') are rarer in comparison. There are also diamonds with visible imperfections ('I1'), though they are less common. The dataset includes very few flawless diamonds ('IF').
Feature Price relashionship¶
Numerical features vs. Price (Scatter plots)¶
In this segment of the exploratory analysis, our primary goal is to unravel the nuances of diamond pricing by investigating the influence of carat weight, depth, table, and dimensional attributes (x, y, z) on diamond prices. Through this exploration, we aim to gain valuable insights into how these key features impact the overall pricing of diamonds.
# Numerical features vs. Price (Scatter plots)
plt.figure(figsize=(18, 12))
plt.subplot(2, 3, 1)
seaborn.scatterplot(x='carat', y='price', data=df, color='blue', alpha=0.5)
plt.title('Carat vs. Price')
plt.subplot(2, 3, 2)
seaborn.scatterplot(x='depth', y='price', data=df, color='green', alpha=0.5)
plt.title('Depth vs. Price')
plt.subplot(2, 3, 3)
seaborn.scatterplot(x='table', y='price', data=df, color='orange', alpha=0.5)
plt.title('Table vs. Price')
plt.subplot(2, 3, 4)
seaborn.scatterplot(x='x', y='price', data=df, color='red', alpha=0.5)
plt.title('X vs. Price')
plt.subplot(2, 3, 5)
seaborn.scatterplot(x='y', y='price', data=df, color='purple', alpha=0.5)
plt.title('Y vs. Price')
plt.subplot(2, 3, 6)
seaborn.scatterplot(x='z', y='price', data=df, color='brown', alpha=0.5)
plt.title('Z vs. Price')
plt.tight_layout()
plt.show()
Depth and Table: These dimensions exhibit no clear correlation with prices. The absence of a distinct trend suggests a limited impact on pricing decisions. Our strategy involves investigating outliers for potential pricing anomalies.
Dimension (X): The positive correlation indicates that increased length (x) corresponds to higher prices. However, the presence of zero values requires careful scrutiny. Our strategy involves addressing these zeros to ensure accurate pricing analysis.
Dimensions (Y and Z): While showing a positive relationship with prices, it's not as pronounced as x. Outliers and values near zero necessitate attention, signaling potential data discrepancies. Our strategy involves a thorough examination of outliers and addressing zero values for precise analysis.
Let have a closer look at the relationship betweent carat and price
plt.figure(figsize=(10, 6))
plt.scatter(df['carat'], df['price'], color='blue', alpha=0.3)
plt.xlabel('Carat')
plt.ylabel('Price')
plt.title('Scatter Plot of Carat vs Price')
plt.show()
sampled_data = df.sample(n=5000)
plt.scatter(sampled_data['carat'], sampled_data['price'], color='blue', alpha=0.3)
<matplotlib.collections.PathCollection at 0x1ed477f5b90>
The scatter plots visually depict the relationship between carat weight and price for diamonds in our dataset. In the first plot, we observe a general trend: as carat weight increases, so does the price. This positive correlation suggests that larger diamonds tend to be more expensive.
In the second plot, where we sampled 5000 data points, a striking pattern emerges. Diamonds with carat weights at whole or half values, such as 1, 1.5, 2, etc., are noticeably concentrated. This concentration indicates that these specific carat weights are particularly significant in the market. Buyers and sellers often prefer these standard weights, leading to a cluster of data points around these values. The presence of these dense clusters around specific carat weights highlights the market's preference for diamonds with these standard values, influencing their pricing and demand
Categorical features vs. Price¶
In this section, our focus shifts to understanding the pricing dynamics associated with categorical features – cut, color, and clarity. We will delve into the distribution of prices across different cut qualities, color grades, and clarity levels. The objective is to unravel patterns and variations that might offer valuable insights into how these categorical features contribute to the overall pricing landscape of diamonds.
# Categorical features vs. Price (Box plots)
plt.figure(figsize=(18, 6))
plt.subplot(1, 3, 1)
seaborn.boxplot(x='cut', y='price', data=df, palette='Set1')
plt.title('Cut vs. Price')
plt.subplot(1, 3, 2)
seaborn.boxplot(x='color', y='price', data=df, palette='Set1')
plt.title('Color vs. Price')
plt.subplot(1, 3, 3)
seaborn.boxplot(x='clarity', y='price', data=df, palette='Set1')
plt.title('Clarity vs. Price')
plt.tight_layout()
plt.show()
Examining the diamond cut categories—Ideal, Premium, Good, Very Good, and Fair—revealed intriguing pricing variations. Within these categories, maximum prices ranged from 10,000 to 14,500, while minimum prices were consistently around 400. Median prices fluctuated, with the highest seen in the Good category at 3,000. Outliers were prevalent across all categories, suggesting that cut quality alone might not be the sole determinant of diamond prices. This prompts a deeper exploration of the intricate interplay between cut quality and other diamond attributes for a more refined pricing strategy.
Examining the clarity feature's influence on diamond prices revealed considerable variation across categories. The maximum price range spanned from 3,000 to 18,000, showcasing distinct price tiers. Notably, the highest median was associated with clarity grade SI1, reaching around 4,500. All clarity categories exhibited outliers, suggesting a nuanced approach in pricing strategies to account for the varying impact of clarity on diamond prices. The consistent presence of outliers emphasizes the need to carefully consider clarity alongside other attributes in pricing decisions.
The exploration of the color feature's impact on diamond prices uncovered diverse pricing patterns. The maximum price ranged from approximately 16,250 to a consistent 400 across all color grades. Interestingly, the highest median was associated with color grade J, reaching around 4,500. Notably, all color categories exhibited outliers, indicating potential pricing variations within each grade. The presence of outliers underscores the importance of considering color as a crucial factor in diamond pricing decisions, with variations in pricing strategies required for different color grades.
Data Preparation¶
Let get a random sample of 10 observations from the data.
df.sample(10)
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 9677 | 1.52 | Fair | H | I1 | 65.4 | 62.0 | 4648 | 7.10 | 7.02 | 4.62 |
| 4867 | 1.00 | Premium | G | SI2 | 61.4 | 61.0 | 3713 | 6.35 | 6.32 | 3.89 |
| 30725 | 0.41 | Good | D | SI1 | 63.2 | 58.0 | 738 | 4.72 | 4.77 | 3.00 |
| 26344 | 0.32 | Ideal | G | VS1 | 61.8 | 55.0 | 645 | 4.42 | 4.45 | 2.74 |
| 3132 | 0.79 | Ideal | D | SI1 | 61.3 | 54.0 | 3328 | 5.96 | 6.01 | 3.67 |
| 44811 | 0.50 | Ideal | E | VS2 | 61.4 | 57.0 | 1624 | 5.08 | 5.11 | 3.13 |
| 16565 | 1.01 | Premium | G | VS1 | 62.8 | 59.0 | 6618 | 6.37 | 6.34 | 3.99 |
| 49627 | 0.30 | Premium | E | SI2 | 61.9 | 58.0 | 540 | 4.31 | 4.28 | 2.66 |
| 1276 | 0.81 | Very Good | I | VS1 | 62.7 | 58.0 | 2950 | 5.90 | 5.96 | 3.72 |
| 22520 | 1.50 | Very Good | H | VS2 | 61.6 | 55.0 | 10558 | 7.37 | 7.43 | 4.56 |
#verify if the dataset has na values
df.isna().sum()
carat 0 cut 0 color 0 clarity 0 depth 0 table 0 price 0 x 0 y 0 z 0 dtype: int64
from the results ,I can see that there is no NA
From the precedents visualisations, we have noticed that there were some dimensionless diamonds in the dataset. That is why we will drop them.
sum_zero_x = df.loc[df['x'] == 0, 'x'].count()
sum_zero_y = df.loc[df['y'] == 0, 'y'].count()
sum_zero_z = df.loc[df['z'] == 0, 'z'].count()
print("count of elements equal to zero in x:", sum_zero_x)
print("count of elements equal to zero in y:", sum_zero_y)
print("count of elements equal to zero in z:", sum_zero_z)
count of elements equal to zero in x: 8 count of elements equal to zero in y: 7 count of elements equal to zero in z: 20
#Dropping dimentionless diamonds
df = df.drop(df[df["x"]==0].index)
df = df.drop(df[df["y"]==0].index)
df = df.drop(df[df["z"]==0].index)
#Dropping the outliers.
df = df[(df["depth"]<75)&(df["depth"]>45)]
df = df[(df["table"]<80)&(df["table"]>40)]
df = df[(df["x"]<30)]
df = df[(df["y"]<30)]
df = df[(df["z"]<30)&(df["z"]>2)]
Encoding¶
The features cut,color and clarity are categorical and need to be mapped to integers. But before, I need to know what are the unique values per each features
unique1=df["cut"].unique()
unique2=df["color"].unique()
unique3=df["clarity"].unique()
unique1
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
unique2
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)
unique3
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
dtype=object)
now that we know the differents unique values, it is possible to replace them by numerical values
# Define a custom mapping dictionary based on ranking values
cut_mapping = {
'Fair': 4, # Least Preferred
'Good': 3,
'Very Good': 2,
'Premium': 1,
'Ideal': 0 # Most Preferred
}
# Apply the custom mapping to the 'cut' column
df['cut'] = df['cut'].map(cut_mapping)
Using this approach ensures that the numerical values represent the ranking order of the 'cut' categories, meeting the specific requirement.
# Define a custom mapping dictionary based on ranking values
clarity_mapping = {
'I1': 7, # Least Preferred
'SI2': 6,
'SI1': 5,
'VS2': 4,
'VS1': 3,
'VVS2': 2,
'VVS1': 1,
'IF': 0 # Most Preferred
}
# Apply the custom mapping to the 'clarity' column
df['clarity'] = df['clarity'].map(clarity_mapping)
This encoding maintains the ordinal relationship of the 'clarity' grades and allows to use these numerical values in the analysis.
# Define a custom mapping dictionary based on ranking values
color_mapping = {
'J': 6, # Least Preferred
'I': 5,
'H': 4,
'G': 3,
'F': 2,
'E': 1,
'D': 0 # Most Preferred
}
# Apply the custom mapping to the 'color' column
df['color'] = df['color'].map(color_mapping)
This encoding preserves the ordinal relationship of the 'color' grades, allowing to use these numerical values in the analysis.
df
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 0 | 1 | 6 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | 1 | 1 | 5 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | 3 | 1 | 3 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | 1 | 5 | 4 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | 3 | 6 | 6 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53938 | 0.86 | 1 | 4 | 6 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | 0 | 0 | 6 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
| 53940 | 0.71 | 1 | 1 | 5 | 60.5 | 55.0 | 2756 | 5.79 | 5.74 | 3.49 |
| 53941 | 0.71 | 1 | 2 | 5 | 59.8 | 62.0 | 2756 | 5.74 | 5.73 | 3.43 |
| 53942 | 0.70 | 2 | 1 | 4 | 60.5 | 59.0 | 2757 | 5.71 | 5.76 | 3.47 |
53910 rows × 10 columns
Linear regression¶
Preprocessing¶
Feature selection¶
correlations = df.corr()
plot = seaborn.heatmap(correlations, cbar=True, annot=True, fmt=".1f", vmin=-1)
The heatmap analysis paved the way for a thoughtful strategy in the subsequent linear regression.Noteworthy insights guiding our approach:
-Features like 'x' 'y' 'z' 'carat' displayed strong positive correlations, indicating their collective impact on diamond pricing.
-The unity in weights assigned to 'x,' 'y,' and 'z' suggested a cohesive role of dimensions in predicting prices. The trio's consistent influence could steer our focus in the linear regression.
-With 'carat' holding a weight of 1.0, its dominance in pricing became evident. This pivotal role emphasizes the need to explore 'carat' comprehensively in subsequent linear modeling.
-Understanding the intricate interplay between 'depth' and 'table' was crucial. Their assigned weights hinted at potential balancing acts in the linear regression, ensuring a nuanced pricing prediction.
1- All features¶
First, we will start by training the model using all features of a diamond
features= ['depth', 'clarity','cut','table','color','x', 'y', 'z','carat']
target= "price"
X = df[features]
y = df[target]
Splitting into train/test¶
In preparation for building our model, we split the dataset into training and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=42)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 53910 observations, of which 43128 are now in the train set, and 10782 in the test set.
We have successfully divided our dataset into training and testing sets. The training set comprises 43,128 observations, which will be used to train our model, while the test set contains 10,782 observations, enabling us to assess the model's predictive capabilities on unseen data.
modelling¶
We'll use the R² score, a key metric, to measure how well our model predicts diamond prices.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
result = model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.9087616369934168
With an R² score of 0.908, our model has demonstrated a strong predictive power, explaining approximately 91% of the variance in diamond prices. This high R² score signifies that our chosen features have been effective in capturing the nuances of diamond pricing. As we move forward, we will continue to assess differents feature combinations to further refine our predictions.
Evaluation¶
In order to evaluate the effectiveness of our trained model, we compare the predicted diamond prices against the actual values from our test set.
predictions = model.predict(X_test)
prediction_overview = pandas.DataFrame()
prediction_overview["truth"] = y_test
prediction_overview["prediction"] = predictions
prediction_overview["difference"] = prediction_overview["truth"] - prediction_overview["prediction"]
prediction_overview["difference"] = abs(prediction_overview["difference"].astype(int))
prediction_overview = prediction_overview.reset_index(drop=True)
prediction_overview
| truth | prediction | difference | |
|---|---|---|---|
| 0 | 4669 | 6435.994975 | 1766 |
| 1 | 8653 | 11322.936664 | 2669 |
| 2 | 975 | 1096.992393 | 121 |
| 3 | 14474 | 10977.923292 | 3496 |
| 4 | 716 | 156.402876 | 559 |
| ... | ... | ... | ... |
| 10777 | 1439 | 2934.525315 | 1495 |
| 10778 | 6040 | 6255.182911 | 215 |
| 10779 | 2484 | 2792.777654 | 308 |
| 10780 | 1053 | 601.646564 | 451 |
| 10781 | 4921 | 4669.914091 | 251 |
10782 rows × 3 columns
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 22945 Root Mean Squared difference: 1192
The maximum difference of 22945 and the Root Mean Squared Error (RMSE) of 1192 both measure the model's prediction accuracy. A lower maximum difference and RMSE indicate that the model's predictions are closer to the actual prices on average. In this case, the model's predictions are quite accurate, with relatively small deviations from the true prices.
Our predictions come to life in the scatter plot, vividly contrasting the predicted and actual diamond prices. This visual insight supplements our numerical evaluation, enhancing our understanding of model performance and any deviations. Together, these aspects offer a comprehensive assessment of our model's accuracy and patterns in prediction
plot = seaborn.regplot(y=y_test.values.flatten(), x=predictions.flatten(), line_kws={"color": "r"})
plot.set_xlabel("predicted diamond price")
plot.set_ylabel("true diamond price")
plot
<Axes: xlabel='predicted diamond price', ylabel='true diamond price'>
Plot the residuals¶
plot = seaborn.residplot(y=y_test,x=predictions)
plot.set_xlabel('Predicted Values')
plot.set_ylabel('Residuals')
plot.set_title('Residual Plot')
plot
<Axes: title={'center': 'Residual Plot'}, xlabel='Predicted Values', ylabel='Residuals'>
The residual plot model predominantly shows positive residuals, indicating a tendency to overestimate diamond prices. the positive residuals persist until the value 0 .Notably, there are outliers, especially for high-priced diamonds, suggesting areas for model improvement. However, the majority of points cluster around the regression line, indicating a reasonably good fit with room for enhancement, particularly in handling high-value outliers
2- 4Cs of diamonds¶
features= [ 'clarity','cut','color','carat']
target= "price"
X = df[features]
y = df[target]
Splitting into train/test¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=42)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 53910 observations, of which 43128 are now in the train set, and 10782 in the test set.
modelling¶
Now, let's evaluate the model's effectiveness.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
result = model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.9048849423757335
An R² value of 0.90 suggests that 90% of the variation in diamond prices can be explained by the 4Cs (carat, clarity, cut, and color). This indicates a strong relationship between these traditional diamond characteristics and their prices.
Evaluation¶
predictions = model.predict(X_test)
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 19805 Root Mean Squared difference: 1217
With a maximum difference of 19805 and an RMSE of 1217, the model's predictions show a reasonable level of accuracy. While not as precise as the model with all features, these results indicate a relatively close match between predicted and actual prices.
plot = seaborn.regplot(y=y_test.values.flatten(), x=predictions.flatten(), line_kws={"color": "r"})
plot.set_xlabel("predicted diamond price")
plot.set_ylabel("true diamond price")
plot
<Axes: xlabel='predicted diamond price', ylabel='true diamond price'>
Plot the residuals¶
plot = seaborn.residplot(y=y_test,x=predictions)
plot.set_xlabel('Predicted Values')
plot.set_ylabel('Residuals')
plot.set_title('Residual Plot')
plot
<Axes: title={'center': 'Residual Plot'}, xlabel='Predicted Values', ylabel='Residuals'>
The residual plot for the '4Cs of diamonds' model displays predominantly positive residuals, signifying a tendency to overestimate diamond prices. Notable outliers, particularly for high-priced diamonds. While most points cluster around the regression line, the model could be refined further, especially in handling high-value outliers.
3- Dimensions z,y and x¶
features= ['x', 'y', 'z']
target= "price"
X = df[features]
y = df[target]
Splitting into train/test¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=42)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 53910 observations, of which 43128 are now in the train set, and 10782 in the test set.
Modelling¶
from sklearn.linear_model import LinearRegression
model = LinearRegression()
result = model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.7885621800622987
An R² value of 0.78 means that 78% of the variation in diamond prices is explained by the dimensions (z, y, and x). This indicates a moderate correlation between these features and diamond prices.
Evaluation¶
predictions = model.predict(X_test)
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 12155 Root Mean Squared difference: 1814
The maximum difference of 12155 and the RMSE of 1814 highlight a reasonable prediction accuracy. While not as accurate as the models considering 4Cs, these results still demonstrate a relatively good fit.
plot = seaborn.regplot(y=y_test.values.flatten(), x=predictions.flatten(), line_kws={"color": "r"})
plot.set_xlabel("predicted diamond price")
plot.set_ylabel("true diamond price")
plot
<Axes: xlabel='predicted diamond price', ylabel='true diamond price'>
Plot the residuals¶
plot = seaborn.residplot(y=y_test,x=predictions)
plot.set_xlabel('Predicted Values')
plot.set_ylabel('Residuals')
plot.set_title('Residual Plot')
plot
<Axes: title={'center': 'Residual Plot'}, xlabel='Predicted Values', ylabel='Residuals'>
The residual plot for the 'Dimensions z, y, and x' model shows a pattern where points between -2500 and 0 predominantly lie above the regression line, indicating an overestimation of diamond prices. Between 0 and 5000, there is a shift with more points falling below the line. Beyond 5000, there is a mix of points both above and below the line, suggesting the model's limitations in accurately predicting prices. Notably, there are outliers below the line between 5000 and 7500, and above the line from 14000, indicating challenges in handling specific diamond values.
4- Depth and Table Percentage¶
features= ['depth','table']
target= "price"
X = df[features]
y = df[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=42)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 53910 observations, of which 43128 are now in the train set, and 10782 in the test set.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
result = model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.017558648525327736
An R² of approximately 0.01 indicates that only 1% of the variation in diamond prices can be explained by depth and table percentage. This suggests a weak correlation between these features and diamond prices
predictions = model.predict(X_test)
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 15395 Root Mean Squared difference: 3909
The large maximum difference of 15395 and the RMSE of 3909 indicate significant inaccuracies in the model's predictions. The model struggles to capture the nuances of diamond pricing when considering only depth and table percentage.
plot = seaborn.regplot(y=y_test.values.flatten(), x=predictions.flatten(), line_kws={"color": "r"})
plot.set_xlabel("predicted diamond price")
plot.set_ylabel("true diamond price")
plot
<Axes: xlabel='predicted diamond price', ylabel='true diamond price'>
Plot the residuals¶
plot = seaborn.residplot(y=y_test,x=predictions)
plot.set_xlabel('Predicted Values')
plot.set_ylabel('Residuals')
plot.set_title('Residual Plot')
plot
<Axes: title={'center': 'Residual Plot'}, xlabel='Predicted Values', ylabel='Residuals'>
In the 'Depth and Table Percentage' model, two points at around 800 and 2100 align closely with the regression line. However, beyond 3000, there are numerous residuals above the line, indicating challenges in accurately predicting prices for diamonds with higher prices.
5-carat and dimensions¶
features= ['x', 'y', 'z','carat']
target= "price"
X = df[features]
y = df[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=42)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 53910 observations, of which 43128 are now in the train set, and 10782 in the test set.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
result = model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.8583639522566571
An R² value of 0.85 suggests that 85% of the variation in diamond prices can be explained by carat alongside the dimensions (x, y, and z). This indicates a strong relationship between carat and dimensions with diamond prices.
predictions = model.predict(X_test)
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 23580 Root Mean Squared difference: 1485
With a maximum difference of 23580 and an RMSE of 1485, the model's predictions demonstrate reasonably good accuracy. Considering carat alongside dimensions provides a balanced approach, resulting in accurate predictions without the complexity of all features.
plot = seaborn.regplot(y=y_test.values.flatten(), x=predictions.flatten(), line_kws={"color": "r"})
plot.set_xlabel("predicted diamond price")
plot.set_ylabel("true diamond price")
plot
<Axes: xlabel='predicted diamond price', ylabel='true diamond price'>
Plot the residuals¶
plot = seaborn.residplot(y=y_test,x=predictions)
plot.set_xlabel('Predicted Values')
plot.set_ylabel('Residuals')
plot.set_title('Residual Plot')
plot
<Axes: title={'center': 'Residual Plot'}, xlabel='Predicted Values', ylabel='Residuals'>
For the 'Carat and Dimensions' model, the residual plot indicates a relatively balanced distribution of points around the regression line, suggesting a reasonable fit for diamonds with carat and dimension features. However, there are notable outliers above the line at lower price points, around -5000 and 0, possibly indicating an overestimation of prices for smaller diamonds. Additionally, as prices increase beyond 12500, there are outliers below the line, indicating an underestimation of prices for larger and more valuable diamonds. These patterns highlight the model's struggle to accurately predict prices for extreme values.
Random Forest¶
Before moving to Random Forest, we explored feature importance using a Decision Tree. This simple model gives us a sneak peek into which features play a key role in predicting diamond prices beside carat. Let's see what the initial findings reveal.
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
Before delving into decision tree feature selection, it's essential to address the exclusion of carat as a feature. Carat is a well-established factor influencing diamond prices. The decision to focus on other features stems from a strategic choice to explore lesser-known contributors, aiming for a more comprehensive understanding of the intricate dynamics influencing diamond pricing.
# Features and target
features = ['clarity', 'color', 'y','z','x','depth','table']
target = 'price'
X = df[features]
y = df[target]
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Decision Tree for feature importance
decision_tree = DecisionTreeRegressor(random_state=42)
decision_tree.fit(X_train, y_train)
DecisionTreeRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=42)
Let plot the features
plt.figure(figsize=(15, 6))
plot = seaborn.barplot(y=decision_tree.feature_importances_, x=features)
In creating an accurate model to predict diamond prices, the choice of features is crucial. After careful consideration, the features 'y', 'color', and 'clarity' were selected for their significant impact on diamond pricing. 'Y' represents the diamond's depth, contributing to its overall dimensions. 'Color' indicates the diamond's color grade, while 'clarity' assesses its internal flaws. These features were chosen based on their well-known influence on diamond values, ensuring the precision and relevance of our pricing model.
Let visualize the decision tree
# Visualize the Decision Tree
plt.figure(figsize=(30, 20))
plot_tree(decision_tree, feature_names=features,fontsize=7, filled=True,max_depth=4)
plt.show()
Our decision tree analysis reveals that the depth ('y') of a diamond is a major factor affecting its price. Additionally, the color and clarity of the diamond play crucial roles, each with specific thresholds. While other features matter, they have more secondary importance. This analysis guides our choice of features for the Random Forest model, making it a reliable predictor that considers the various aspects influencing diamond prices.
Random Forest Modeling with Chosen Features¶
features= ['clarity','color','y']
target= "price"
X = df[features]
y = df[target]
Splitting into train/test¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 53910 observations, of which 43128 are now in the train set, and 10782 in the test set.
Modelling¶
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor( n_jobs=-1, random_state=42)
random_forest.fit(X_train, y_train)
predictions_random = random_forest.predict(X_test)
R_square=random_forest.score(X_test, y_test)
print("R-squared score:", R_square)
R-squared score: 0.9691172480215086
The high R-squared score of 0.97 indicates that approximately 97.1% of the variance in diamond prices is explained by the features 'y', 'color', and 'clarity'. This suggests a strong correlation between these features and the actual prices.
Evaluation¶
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions_random)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions_random)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 6650 Root Mean Squared difference: 695
The model's predictions exhibited a maximum difference of around 6004 units from the actual prices, indicating some potential inaccuracies. The root mean squared difference, averaging 674 units, signifies overall prediction accuracy but highlights significant outliers.
Hyperparameter Tuning¶
Next, we delve into the hyperparameter tuning process to optimize our Random Forest model further. By exploring different combinations, we aim to discover the ideal values for 'max_depth' and 'n_estimators' that enhance the model's overall performance.
from sklearn.model_selection import GridSearchCV
param_grid = {
'min_samples_split': [2, 5, 10,15],
'min_samples_leaf': [1, 2, 4,6],
'bootstrap': [True, False]
}
random_forest = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(random_forest, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)
Best Hyperparameters: {'bootstrap': True, 'min_samples_leaf': 6, 'min_samples_split': 15}
After optimizing the Random Forest model, the best hyperparameters were determined to be {'bootstrap': True, 'min_samples_leaf': 6, 'min_samples_split': 15}. These settings are expected to enhance the model's performance by refining the criteria for splitting nodes and controlling leaf node sizes. Enabling bootstrap sampling further adds robustness to the model.
Modelling¶
The model's performance is assessed by examining the R-squared score and the maximum difference and root mean squared difference metrics
from sklearn.ensemble import RandomForestRegressor
optimized_random_forest = RandomForestRegressor(bootstrap= True, min_samples_leaf= 6, min_samples_split= 15, n_jobs=-1, random_state=42)
optimized_random_forest.fit(X_train, y_train)
predictions_optimized = optimized_random_forest.predict(X_test)
optimized_R_square=optimized_random_forest.score(X_test, y_test)
print("R-squared score:", optimized_R_square)
R-squared score: 0.9739439995303263
from sklearn.model_selection import cross_val_score
# Cross-validation
cv_scores = cross_val_score(optimized_random_forest, X_train, y_train, cv=5, scoring='r2')
print("Cross-Validation Scores:", cv_scores)
print("Mean R-squared Score:", cv_scores.mean())
Cross-Validation Scores: [0.97605937 0.97619518 0.9749442 0.97481567 0.97546035] Mean R-squared Score: 0.9754949558805827
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions_optimized)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions_optimized)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 6198 Root Mean Squared difference: 639
After optimizing the Random Forest model's hyperparameters, we observed a notable improvement in its predictive performance. The R-squared score increased to 0.97, indicating a better fit to the data. Additionally, the maximum difference and RMSE reduced to 6223 and 624, respectively, reflecting enhanced accuracy and precision in predicting diamond prices. The cross-validation results further validate the effectiveness of the hyperparameter-tuned random forest model. The mean R-squared score of 0.97 across five folds indicates consistent and robust performance, suggesting that the model work well.
Let visualized the predicted prices against the actual prices.
import matplotlib.pyplot as plt
# Scatter plot
plt.scatter(y_test, predictions_optimized, alpha=0.5)
plt.title('Actual Prices vs Predicted Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()
In general, the points align closely, indicating a good fit between the model's predictions and the true prices. However, there is a notable outlier where the actual price is significantly higher than the predicted price, aligning with our earlier analysis of maximum difference values.This suggests a potential anomaly or specific condition not well captured by the model.
Boosting¶
Boosting may not be mandatory for random forest, but it still good to try to see how it can perform
from sklearn.ensemble import AdaBoostRegressor
# optimized random forest as the base model
base_model = optimized_random_forest
boosted_model = AdaBoostRegressor(base_model, learning_rate=0.1, random_state=42)
boosted_model.fit(X_train, y_train)
# Predictions
predictions_boosted = boosted_model.predict(X_test)
# Evaluate the model
score_boosted = boosted_model.score(X_test, y_test)
print("Boosted Random Forest Model R-squared score:", score_boosted)
Boosted Random Forest Model R-squared score: 0.9726271244653908
import io, math
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
me = max_error(y_test, predictions_boosted)
me = math.ceil(me)
print("Max difference:", me)
mse = mean_squared_error(y_test, predictions_boosted)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared difference:", rmse)
Max difference: 6467 Root Mean Squared difference: 655
The Boosted Random Forest model performed well with an R-squared score of 0.97 and a Root Mean Squared Difference of 655. While this demonstrates strong predictive ability, the optimized Random Forest model still outperformed it slightly. The optimization process significantly enhanced the model's accuracy in estimating diamond prices
I have choose to focuse on the optimized model. It's the better version of our prediction tool and made sure it gives the most accurate diamond price predictions based on clarity, color, and the factor y(width).
Interference¶
Now that we have trained the model,let try it!
Guide:
CLARITY:
'I1': 7, # Least Preferred
'SI2': 6,
'SI1': 5,
'VS2': 4,
'VS1': 3,
'VVS2': 2,
'VVS1': 1,
'IF': 0 # Most Preferred
COLOR:
'J': 6, # Least Preferred
'I': 5,
'H': 4,
'G': 3,
'F': 2,
'E': 1,
'D': 0 # Most Preferred
import pandas as pd
model = optimized_random_forest
def predict_diamond_price():
try:
clarity = float(input("Enter clarity (0-7): "))
color = float(input("Enter color (0-6): "))
y = float(input("Enter Y dimension (3.68-10.54): "))
if 0 <= clarity <= 7 and 0 <= color <= 6 and 3.68 <= y <= 10.54:
input_data = pd.DataFrame({"clarity": [clarity], "color": [color], "y": [y]})
price_prediction = model.predict(input_data)[0]
error_margin_percentage = 1
# Calculate error margin
error_margin = price_prediction * (error_margin_percentage / 100)
# Calculate price range
lower_bound = price_prediction - error_margin
upper_bound = price_prediction + error_margin
print(f"Predicted Diamond Price Range: {lower_bound:.2f} to {upper_bound:.2f}")
else:
print("Invalid input. Please enter values within the specified ranges.")
except ValueError:
print("Invalid input. Please enter numeric values.")
predict_diamond_price()
Predicted Diamond Price Range: 1040.04 to 1061.05
Demostrations¶
Stakeholders¶
Stakeholder 1: Future Diamond Buyer¶
Background¶
Stakeholder 1, representing a prospective diamond buyer, entered the demonstration with limited knowledge about diamond pricing. This individual was seeking to explore the price ranges of diamonds and gain a better understanding of how specific characteristics affect pricing.
Number of Intakes and Duration¶
Stakeholder 1 engaged in two separate sessions to explore different scenarios. The duration of each session averaged around 5 and 2 minutes respectly, including the time spent reviewing the guide and considering input values.
First Intake¶
In the initial session, the stakeholder aimed to understand the cost of an average-quality diamond. Input values, such as clarity (4), color (4), and the Y dimension (10), were chosen for this exploration. The model predicted a price range of 16,786 to 17,125.16. Stakeholder 1, upon seeing the results, expressed surprise at the price point and curiosity about the model's accuracy since he didnt know that diamond could be that expensive.
Model Accuracy Questioning¶
During the discussion of model accuracy, Stakeholder 1 initially had reservations. However, as the model's workings were explained, the stakeholder's confidence in the model's accuracy increased. The educational aspect of the demonstration played a key role in building trust.
Second Intake¶
For the second session, the stakeholder adjusted the input values to reflect preferences for a more budget-friendly diamond. With clarity (6), color (6), and Y dimension (4), the model predicted a range of 387.3 to 395.13. Stakeholder 1 carefully chose these values to align with a specific budget, showcasing an additional use case for the model.
Stakeholder Comments¶
Throughout both intakes, Stakeholder 1 consistently referenced the provided guide to interpret the clarity and color values. Expressing curiosity about the accuracy of the model, this stakeholder appreciated the educational aspect of the demonstration.
Stakeholder 2: Senegalese Jeweler¶
Background¶
Stakeholder 2, an experienced Senegalese jeweler, joined the demonstration with a strong background in diamond knowledge. This individual aimed to validate the model's predictions based on their expertise in the field.
Number of Intakes and Duration¶
Stakeholder 2 participated in a single session, and the entire process took approximately 2-3 minutes. The quick duration was influenced by the stakeholder's familiarity with diamond pricing concepts.
Intake¶
In the sole session, the stakeholder entered values for clarity (5), color (3), and Y dimension (5). The model swiftly generated a prediction ranging from 1,125.03 to 1,146.76. Stakeholder 2 promptly confirmed the predictions using the guide and expressed confidence in the accuracy of the model.
Stakeholder Comments¶
The experienced jeweler found the model's predictions aligned with their expectations (he said before entering the values that normally it must not be more than around 1200 or 1300). Stakeholder 2 particularly appreciated the efficiency of the model and its ability to provide accurate predictions promptly.
Demonstration Process¶
The demonstration process was structured to provide stakeholders with a seamless experience. A guide detailing clarity and color equivalents was made available to assist stakeholders in making informed choices. Stakeholders interacted with the Python notebook, inputting values and receiving detailed predicted price ranges.
Feedback¶
Common Feedback Themes¶
Model Value¶
Both stakeholders acknowledged the value of the model's quick and accurate predictions. Stakeholder 1 found the demonstration to be an educational experience, helping in understanding the intricacies of diamond pricing.
Efficiency¶
Stakeholder 2 highlighted the efficiency of the model, emphasizing its quick response time. The rapid generation of predictions within 2 seconds contributed to a streamlined and effective demonstration.
Duration Analysis¶
While Stakeholder 1 invested more time in exploring different scenarios, Stakeholder 2 had time constraints. The efficient nature of the model allowed for quick interactions, catering to stakeholders with varying schedules.
Stakeholder 1: Future Diamond Buyer¶
Overview¶
Stakeholder 1, a prospective diamond buyer with limited prior knowledge, provided valuable feedback following the demonstration. The insights shared by this stakeholder contribute to the overall evaluation of the model and its potential for real-world applications.
Detailed Feedback¶
-Clarity and Explanation: Stakeholder 1 appreciated the detailed and well-explained guide provided at the beginning of the demonstration. The guide, specifically designed for individuals with limited knowledge about diamonds, was found to be beneficial in understanding the pricing factors.
-Feature Selection: The stakeholder liked the choice of features in the model, highlighting that the inclusion of only three features streamlined the input process. This decision was seen as practical to prevent the input of numerous values and ensure a quick prediction.
-Confidence in Model: Expressing confidence in the model's efficiency, Stakeholder 1 mentioned being willing to use the application in the future when purchasing a diamond. The stakeholder tested the model's accuracy by trying two different diamond price ranges, reinforcing trust in its capabilities.
-Accuracy Acknowledgment: Upon learning that the model is not 100% accurate but approximately 97%, Stakeholder 1 found this level of accuracy acceptable. The stakeholder drew a parallel with weather forecasting apps, acknowledging that not all machines can achieve perfect accuracy.
Suggestions for Improvement¶
Suggestion for Transparency: The stakeholder advised including a sentence in the application that communicates the possibility of occasional errors in pricing. This transparency, according to Stakeholder 1, would set realistic expectations for users.
Stakeholder 2: Senegalese jeweler¶
Overview¶
Stakeholder 2, a Senegalese jeweler, provided valuable insights into the model's performance, highlighting its efficiency and accuracy in predicting diamond prices.
Detailed Feedback¶
Efficiency and Accuracy Acknowledgment: Stakeholder 2 expressed satisfaction with the model's efficiency and accuracy. The quick response time and precise predictions were particularly appreciated, making the tool valuable for quick assessments.
Enthusiasm for Extended Usage: Stakeholder 2 expressed enthusiasm for the model's efficiency and accuracy. Although the demonstration covered a specific scenario, the stakeholder indicated a desire to engage in further sessions to explore various aspects of the model. This positive feedback suggests potential for extended and diverse usage scenarios beyond the initial demonstration.
Simplicity in Feature Selection: Stakeholder 1 appreciated the model's simplicity, highlighting the choice of only three features. This streamlined approach made the interaction process more straightforward, saving time for users. The positive feedback suggests that the minimalistic feature selection was a valuable aspect of the model's design.
International Usefulness: Considering the stakeholder's expertise as a jeweler, they might suggest that the tool could be particularly useful when buying diamonds abroad. Being able to quickly assess the potential price of diamonds in different markets could be a valuable feature.
Pricing for Selling Diamonds: The jeweler may find the tool beneficial when pricing their own diamonds before selling. This application could serve as a quick reference for determining a competitive and fair selling price based on current market trends.
Suggestions for Improvement¶
Transparency in Limitations: Stakeholder 2 could advise on incorporating a section that transparently communicates the tool's limitations, ensuring users are aware of the scenarios in which the model may be less accurate.
TICT¶
In the following section, a quick overview of the Technology Impact and Contextualization Tool (TICT) is presented. Two key snapshots capture the quick scan and subsequent improvements, providing valuable insights into the tool's assessment of the project's technological impact and its context.
Conclusion¶
This project aimed to understand diamond pricing using data analysis and an AI model. We explored data, built a predictive model focusing on clarity, color, and the Y dimension, and engaged with stakeholders to test its practicality.
The model, while quick and versatile, isn't foolproof. Feedback from potential buyers and a jeweler showed its usefulness but also highlighted limitations. The AI model requires specific inputs—clarity, color, and the Y dimension—to provide accurate predictions.
This project demonstrates the real-world application of AI in diamond valuation. The chosen features capture the essence of a diamond's value. Celebrating the progress made, we must also recognize the model's limitations and the complexity of diamond pricing.
It's important to note that the A.I. model, designed for predicting diamond prices, relies on specific input features: clarity, color, and the Y dimension. Without these values, the model might not provide accurate predictions. Users should ensure they input these key features for the A.I. to generate meaningful and reliable price estimates. This feature limitation ensures that the A.I. focuses on what it does best and encourages users to provide essential information for optimal results.
Looking ahead, the goal isn't just predicting prices but offering insights for better decision-making in the diamond market. The journey continues, aiming for ongoing improvements in AI-driven diamond valuation.